from IPython.display import Image
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Image(url= "datascience.png", width=800, height=800)
Access to safe drinking water is essential to health,a basic human right and a component of effective policy for health protection.This is important as a health and development issues at every level.In some region it is shown,it has been shown that investments in water supply and sanitation can yield a net economic benefits.so based on the feature given we have to determine whether the water is portable to drink or not.
1-PORTABLE 0-NOT PORTABLE
PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards
Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels.The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium
Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.
Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.
Sulfates are naturally occuring substances that present in minerals rock and salts. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L)
Pure water is not a good conductor of electricity rathera good insulator.generally the amount of dissolved salts determine the conductivity of water.
Total organic carbon is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment
These are the chemicals which are mixed with chlorines. the concentration of thm depends on oragnic carbon dissolved.thm upto the level of 80 ppm is considered safed.
1.Portable water has a ph of range(7-8).
2.More the solid less will be the purity of water and hence less portable.
3.More the hardness less will be it's portability.
4.Since we know that hardness of water contained some dissloved minerals such as calcium and magnesium which is a metal therefore conductivitty increases as hardness increases.
5.More the organic_carbon more will be the ph.
6.Turbidity increases Conductivity of water
7.Conductivity decreases as Chloramines increases.
8.More will be the organic_carbon less will be it's portability.
9.More the ph quality then there will be chances that it is not portable.
10.If the quantity of Trihalomethanes is above 80 ppm then there will be less chances that it is portable.
11.More the sulfate present less will be it's potability.
12.More the sulfates more will be it's conductivity.
13.More will be the total salts less will be it's portability.
14.More the total salts more the conductivity.
15.More will be the Solids more will be the Hardness
df=pd.read_csv(r"C:\Users\asus\Downloads\water_potability (1).csv")
df
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3271 | 4.668102 | 193.681735 | 47580.991603 | 7.166639 | 359.948574 | 526.424171 | 13.894419 | 66.687695 | 4.435821 | 1 |
| 3272 | 7.808856 | 193.553212 | 17329.802160 | 8.061362 | NaN | 392.449580 | 19.903225 | NaN | 2.798243 | 1 |
| 3273 | 9.419510 | 175.762646 | 33155.578218 | 7.350233 | NaN | 432.044783 | 11.039070 | 69.845400 | 3.298875 | 1 |
| 3274 | 5.126763 | 230.603758 | 11983.869376 | 6.303357 | NaN | 402.883113 | 11.168946 | 77.488213 | 4.708658 | 1 |
| 3275 | 7.874671 | 195.102299 | 17404.177061 | 7.509306 | NaN | 327.459760 | 16.140368 | 78.698446 | 2.309149 | 1 |
3276 rows × 10 columns
#reading similar data
data=pd.read_csv(r"C:\Users\asus\Downloads\water_potability (1).csv")
data
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3271 | 4.668102 | 193.681735 | 47580.991603 | 7.166639 | 359.948574 | 526.424171 | 13.894419 | 66.687695 | 4.435821 | 1 |
| 3272 | 7.808856 | 193.553212 | 17329.802160 | 8.061362 | NaN | 392.449580 | 19.903225 | NaN | 2.798243 | 1 |
| 3273 | 9.419510 | 175.762646 | 33155.578218 | 7.350233 | NaN | 432.044783 | 11.039070 | 69.845400 | 3.298875 | 1 |
| 3274 | 5.126763 | 230.603758 | 11983.869376 | 6.303357 | NaN | 402.883113 | 11.168946 | 77.488213 | 4.708658 | 1 |
| 3275 | 7.874671 | 195.102299 | 17404.177061 | 7.509306 | NaN | 327.459760 | 16.140368 | 78.698446 | 2.309149 | 1 |
3276 rows × 10 columns
df.shape
(3276, 10)
df.head()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
df.info()
#in the sulfate columns no of record are less which means there maybe some null values
#also present.
#similarly 'ph','Trihalomethanes' data is not present fully.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3276 entries, 0 to 3275 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ph 2785 non-null float64 1 Hardness 3276 non-null float64 2 Solids 3276 non-null float64 3 Chloramines 3276 non-null float64 4 Sulfate 2495 non-null float64 5 Conductivity 3276 non-null float64 6 Organic_carbon 3276 non-null float64 7 Trihalomethanes 3114 non-null float64 8 Turbidity 3276 non-null float64 9 Potability 3276 non-null int64 dtypes: float64(9), int64(1) memory usage: 256.1 KB
df.isnull().sum() #getting the total no of null values.
ph 491 Hardness 0 Solids 0 Chloramines 0 Sulfate 781 Conductivity 0 Organic_carbon 0 Trihalomethanes 162 Turbidity 0 Potability 0 dtype: int64
df.describe()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 2785.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 2495.000000 | 3276.000000 | 3276.000000 | 3114.000000 | 3276.000000 | 3276.000000 |
| mean | 7.080795 | 196.369496 | 22014.092526 | 7.122277 | 333.775777 | 426.205111 | 14.284970 | 66.396293 | 3.966786 | 0.390110 |
| std | 1.594320 | 32.879761 | 8768.570828 | 1.583085 | 41.416840 | 80.824064 | 3.308162 | 16.175008 | 0.780382 | 0.487849 |
| min | 0.000000 | 47.432000 | 320.942611 | 0.352000 | 129.000000 | 181.483754 | 2.200000 | 0.738000 | 1.450000 | 0.000000 |
| 25% | 6.093092 | 176.850538 | 15666.690297 | 6.127421 | 307.699498 | 365.734414 | 12.065801 | 55.844536 | 3.439711 | 0.000000 |
| 50% | 7.036752 | 196.967627 | 20927.833607 | 7.130299 | 333.073546 | 421.884968 | 14.218338 | 66.622485 | 3.955028 | 0.000000 |
| 75% | 8.062066 | 216.667456 | 27332.762127 | 8.114887 | 359.950170 | 481.792304 | 16.557652 | 77.337473 | 4.500320 | 1.000000 |
| max | 14.000000 | 323.124000 | 61227.196008 | 13.127000 | 481.030642 | 753.342620 | 28.300000 | 124.000000 | 6.739000 | 1.000000 |
df.fillna(df.mean(),) #we have done missing value imputation
#whereever it is null we have filled it with mean
#which is known as missing value imputation.
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.080795 | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | 333.775777 | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | 333.775777 | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3271 | 4.668102 | 193.681735 | 47580.991603 | 7.166639 | 359.948574 | 526.424171 | 13.894419 | 66.687695 | 4.435821 | 1 |
| 3272 | 7.808856 | 193.553212 | 17329.802160 | 8.061362 | 333.775777 | 392.449580 | 19.903225 | 66.396293 | 2.798243 | 1 |
| 3273 | 9.419510 | 175.762646 | 33155.578218 | 7.350233 | 333.775777 | 432.044783 | 11.039070 | 69.845400 | 3.298875 | 1 |
| 3274 | 5.126763 | 230.603758 | 11983.869376 | 6.303357 | 333.775777 | 402.883113 | 11.168946 | 77.488213 | 4.708658 | 1 |
| 3275 | 7.874671 | 195.102299 | 17404.177061 | 7.509306 | 333.775777 | 327.459760 | 16.140368 | 78.698446 | 2.309149 | 1 |
3276 rows × 10 columns
df# although we have placed or done imputation but still it is showing null
#this arise because we haven't done inplace =True. by default it is false
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3271 | 4.668102 | 193.681735 | 47580.991603 | 7.166639 | 359.948574 | 526.424171 | 13.894419 | 66.687695 | 4.435821 | 1 |
| 3272 | 7.808856 | 193.553212 | 17329.802160 | 8.061362 | NaN | 392.449580 | 19.903225 | NaN | 2.798243 | 1 |
| 3273 | 9.419510 | 175.762646 | 33155.578218 | 7.350233 | NaN | 432.044783 | 11.039070 | 69.845400 | 3.298875 | 1 |
| 3274 | 5.126763 | 230.603758 | 11983.869376 | 6.303357 | NaN | 402.883113 | 11.168946 | 77.488213 | 4.708658 | 1 |
| 3275 | 7.874671 | 195.102299 | 17404.177061 | 7.509306 | NaN | 327.459760 | 16.140368 | 78.698446 | 2.309149 | 1 |
3276 rows × 10 columns
df.fillna(df.mean(),inplace=True)
df
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.080795 | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | 333.775777 | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | 333.775777 | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3271 | 4.668102 | 193.681735 | 47580.991603 | 7.166639 | 359.948574 | 526.424171 | 13.894419 | 66.687695 | 4.435821 | 1 |
| 3272 | 7.808856 | 193.553212 | 17329.802160 | 8.061362 | 333.775777 | 392.449580 | 19.903225 | 66.396293 | 2.798243 | 1 |
| 3273 | 9.419510 | 175.762646 | 33155.578218 | 7.350233 | 333.775777 | 432.044783 | 11.039070 | 69.845400 | 3.298875 | 1 |
| 3274 | 5.126763 | 230.603758 | 11983.869376 | 6.303357 | 333.775777 | 402.883113 | 11.168946 | 77.488213 | 4.708658 | 1 |
| 3275 | 7.874671 | 195.102299 | 17404.177061 | 7.509306 | 333.775777 | 327.459760 | 16.140368 | 78.698446 | 2.309149 | 1 |
3276 rows × 10 columns
df.isnull().sum()
ph 0 Hardness 0 Solids 0 Chloramines 0 Sulfate 0 Conductivity 0 Organic_carbon 0 Trihalomethanes 0 Turbidity 0 Potability 0 dtype: int64
sns.heatmap(df.corr(),annot=True)
figure=plt.gcf()
figure.set_size_inches(15,10)
#we reduce the dimensions which are co-relating
#if it is co-relating 75% also then we can reduce one dimesnions.
#in our dataset none of them is matching or it is 75% then we reduce that dimension.
1.None of the columns are co-relating.
2.The highest corelation is 7.6% between ph and Hardness.
3.Many features or column have negative corelation as well.
df.boxplot(figsize=(16,8))
<AxesSubplot:>
1.The median of the solids is around 21000. 2.Most of the point have above 4500. 3.There are many outliers point.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3276 entries, 0 to 3275 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ph 3276 non-null float64 1 Hardness 3276 non-null float64 2 Solids 3276 non-null float64 3 Chloramines 3276 non-null float64 4 Sulfate 3276 non-null float64 5 Conductivity 3276 non-null float64 6 Organic_carbon 3276 non-null float64 7 Trihalomethanes 3276 non-null float64 8 Turbidity 3276 non-null float64 9 Potability 3276 non-null int64 dtypes: float64(9), int64(1) memory usage: 256.1 KB
df.describe()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 |
| mean | 7.080795 | 196.369496 | 22014.092526 | 7.122277 | 333.775777 | 426.205111 | 14.284970 | 66.396293 | 3.966786 | 0.390110 |
| std | 1.469956 | 32.879761 | 8768.570828 | 1.583085 | 36.142612 | 80.824064 | 3.308162 | 15.769881 | 0.780382 | 0.487849 |
| min | 0.000000 | 47.432000 | 320.942611 | 0.352000 | 129.000000 | 181.483754 | 2.200000 | 0.738000 | 1.450000 | 0.000000 |
| 25% | 6.277673 | 176.850538 | 15666.690297 | 6.127421 | 317.094638 | 365.734414 | 12.065801 | 56.647656 | 3.439711 | 0.000000 |
| 50% | 7.080795 | 196.967627 | 20927.833607 | 7.130299 | 333.775777 | 421.884968 | 14.218338 | 66.396293 | 3.955028 | 0.000000 |
| 75% | 7.870050 | 216.667456 | 27332.762127 | 8.114887 | 350.385756 | 481.792304 | 16.557652 | 76.666609 | 4.500320 | 1.000000 |
| max | 14.000000 | 323.124000 | 61227.196008 | 13.127000 | 481.030642 | 753.342620 | 28.300000 | 124.000000 | 6.739000 | 1.000000 |
df.boxplot(column=['ph'],figsize=((16,8)))
#here the median of the ph is 7 which means most of the water ph is 7 or near to 7.
#there is one outlier value which has ph of 14 and we need to classify whether this water quality is drinkable or not.
<AxesSubplot:>
1.The 50 percentile or median of ph is 7.
2.There are many outliers on both the side some has ph below 4 and some has ph above 10.
df.boxplot(column=['Hardness'],figsize=((16,8)))
<AxesSubplot:>
df.boxplot(column=['Sulfate'],figsize=((16,8)))
<AxesSubplot:>
df.boxplot(column=['Conductivity'],figsize=((16,8)))
<AxesSubplot:>
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3276 entries, 0 to 3275 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ph 3276 non-null float64 1 Hardness 3276 non-null float64 2 Solids 3276 non-null float64 3 Chloramines 3276 non-null float64 4 Sulfate 3276 non-null float64 5 Conductivity 3276 non-null float64 6 Organic_carbon 3276 non-null float64 7 Trihalomethanes 3276 non-null float64 8 Turbidity 3276 non-null float64 9 Potability 3276 non-null int64 dtypes: float64(9), int64(1) memory usage: 256.1 KB
df.boxplot(column=['Turbidity'],figsize=((16,8)))
<AxesSubplot:>
df.boxplot(column=['Trihalomethanes'],figsize=((16,8)))
<AxesSubplot:>
#Chloramines
df.boxplot(column=['Chloramines'],figsize=((16,8)))
<AxesSubplot:>
df['Solids'].describe()
#the mean is 22014
#75% is 27332
#the max value is 61227.196008 which means it is outlier
#here are too many outliers which means excess solidis there which implies
#either water quality is good or bad hence removing outlier will not be a good idea.
#if we remove outlier then we left with good water only which is not also good.
count 3276.000000 mean 22014.092526 std 8768.570828 min 320.942611 25% 15666.690297 50% 20927.833607 75% 27332.762127 max 61227.196008 Name: Solids, dtype: float64
df.head()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.080795 | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
| 1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | 333.775777 | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
| 2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | 333.775777 | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
| 3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
| 4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3276 entries, 0 to 3275 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ph 3276 non-null float64 1 Hardness 3276 non-null float64 2 Solids 3276 non-null float64 3 Chloramines 3276 non-null float64 4 Sulfate 3276 non-null float64 5 Conductivity 3276 non-null float64 6 Organic_carbon 3276 non-null float64 7 Trihalomethanes 3276 non-null float64 8 Turbidity 3276 non-null float64 9 Potability 3276 non-null int64 dtypes: float64(9), int64(1) memory usage: 256.1 KB
#checking the balance of data.
#how many 0's and 1's are there 0 means bad water 1 means good water
#if there is more bad water then it is imbalance data then it is bias also
#therefore checking the balance is important.
df['Potability'].value_counts()
#it implies there are more bad water than good water
#since 0 implies bad water which has count of 1998
#1 implies good water which has count of 1278
0 1998 1 1278 Name: Potability, dtype: int64
df['Conductivity'].value_counts()
333.946197 1
415.624882 1
405.818097 1
399.477358 1
490.821500 1
..
343.417588 1
475.341351 1
449.361679 1
531.602634 1
510.305603 1
Name: Conductivity, Length: 3276, dtype: int64
sns.countplot(df['Potability'])
C:\Users\asus\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='Potability', ylabel='count'>
1.Non-drinkable or non-potable water is more.
2.Non-potability depends on various factors such as carbon content,hardness,sulfates and ph.
3.Hence there are more columns or features which reduces the quality of water
df.describe()
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 3276.000000 |
| mean | 7.080795 | 196.369496 | 22014.092526 | 7.122277 | 333.775777 | 426.205111 | 14.284970 | 66.396293 | 3.966786 | 0.390110 |
| std | 1.469956 | 32.879761 | 8768.570828 | 1.583085 | 36.142612 | 80.824064 | 3.308162 | 15.769881 | 0.780382 | 0.487849 |
| min | 0.000000 | 47.432000 | 320.942611 | 0.352000 | 129.000000 | 181.483754 | 2.200000 | 0.738000 | 1.450000 | 0.000000 |
| 25% | 6.277673 | 176.850538 | 15666.690297 | 6.127421 | 317.094638 | 365.734414 | 12.065801 | 56.647656 | 3.439711 | 0.000000 |
| 50% | 7.080795 | 196.967627 | 20927.833607 | 7.130299 | 333.775777 | 421.884968 | 14.218338 | 66.396293 | 3.955028 | 0.000000 |
| 75% | 7.870050 | 216.667456 | 27332.762127 | 8.114887 | 350.385756 | 481.792304 | 16.557652 | 76.666609 | 4.500320 | 1.000000 |
| max | 14.000000 | 323.124000 | 61227.196008 | 13.127000 | 481.030642 | 753.342620 | 28.300000 | 124.000000 | 6.739000 | 1.000000 |
sns.relplot(x = df.ph, y = df.Conductivity, hue = df.Potability)
#most of the records lie between 4 to 8
#water which has ph of 0 is conductive in nature.
#0 means non portable
#1 means portable
#exact ph 7 has lowest conductivity may be it's has high non metal such as chlorine
#or may be it has low hardness.
#there is one outlier outlier which has a ph of 14 but still it is classifed as portable.
<seaborn.axisgrid.FacetGrid at 0x2272d9fc5b0>
sns.relplot(x = df.ph, y = df.Hardness, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x2272d98f0d0>
1.ph value of 7 has lowest conductivity and it is non portable.
2.ph 0 which is acidic in nature has conductivity near 600 and non portable.
3.Water which is portable has a ph range between 5 to 9.
4.Highest conductivity is more than 700 which is non-portable.
5.ph of 13 is portable which is an outlier
sns.relplot(x = df.Sulfate, y = df.Conductivity, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x2272edfefa0>
1.Sulfate less than 150 has a conductivity of 550 which is portable in nature.
2.Most of the water sulfate is in between 250 and 400.
3.lowest conductivity is less than 200 and has sulfate of 340.
4.there is one outlier which has sulfate of more than 450 and water is portable in that case.
Hence sulfates does not contribute much in the conductivity.
sns.relplot(x = df.Turbidity, y = df.Conductivity, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x2272ee4bbb0>
1.Turbidity which lie between 3 to 7 that has most of the portable water.
2.most turbide water is non portable in nature.
3.Turbidity of 3 has highest conductivity.
4.some point which has 6n turbidity in that case water is drinkable.
sns.relplot(x = df.Solids, y = df.Conductivity, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x2272ee95eb0>
1.Data is left skewed.
2.most of the solids are in range 5000 to 30000.
3.non portable water has highest conductivity more than 700 which is outlier.
sns.relplot(x = df.Chloramines, y = df.Conductivity, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x2272e1d1100>
1.There are two orange point which means portable which have chloramines of 0.
2.Chloramines of 8 has highest conductivity of more than 700 which is an outlier.
4.Some orange points(portable) has cholramines of 12 and greater than 12.
5.Lowest conductivty is at ph value of 7.
sns.relplot(x = df.Organic_carbon, y = df.Potability, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x2272ff7f7f0>
1.Organic_carbon of 0 is portable water which means it is correctly classified.
2.Organic_carbon of more than 25 is non-portable.
3.Organic_carbon doesn't contribute much in deciding thde potability.
sns.relplot(x = df.Trihalomethanes, y = df.Potability, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x22730022490>
sns.relplot(x = df.Sulfate, y = df.Potability, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x2272d7f6fa0>
1.There are more portable water pointa than non portable.
2.sulfate decide the potability of water.
3.Sulfate of less than 150 is classified into potable water.
sns.relplot(x = df.Sulfate, y = df.ph, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x22730092280>
1.Blue points(non potable is not widely distributed when it compare with orange point(potable)
2.Even with more sulfate there is equal chances that water is potable.
3.Mostly non potable(blue points)has a sulfate range between 250 and 400.
4.Even the ph of 13 and sulfatde 360 is classified as potable water.
sns.relplot(x = df.Sulfate, y = df.Conductivity, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x2272d696dc0>
1.sulfate of 0 has conductivity of 550.
2.sulfate of almost 350 is classified as non potable on both extreme.
3.At 340 there are so many orange point arranged in a sequence which means they are overlapping which means sulfate of around 340 has most orange point(potable)
sns.relplot(x = df.Hardness, y = df.Solids, hue = df.Potability)
<seaborn.axisgrid.FacetGrid at 0x227300c9cd0>
1.most of the Non-potable(blue points) has hardness in range 150 and 250.
2.Where as potable water points(orange) is widely distributed.
3.Hardness greater than 300 has mostly orange points(which means water is potable).
4.This proves that our hypothesis is wrong.
sns.scatterplot(x=df.ph,y=df.Potability,hue=df.Potability)
plt.show()
#we know that 1 implies the good data
#0 implies the bad water
#in good water only 1 is exception which means if the ph is greater than 13 then it is consider as bad water
sns.scatterplot(x=df.Organic_carbon,y=df.Potability,hue=df.Potability)
plt.show()
sns.scatterplot(x=df.Hardness,y=df.Potability,hue=df.Potability)
plt.show()
1.For pure water (orange point) hardness varies a lot.
2.Even hardness above than 300 is classified into potable water.
3.Distribution of potable water is more than non potable.
sns.scatterplot(x=df.ph,y=df.Organic_carbon,hue=df.Potability)
<AxesSubplot:xlabel='ph', ylabel='Organic_carbon'>
1.mostly potable water lies in range of ph 5 to 9.
2.no orange point(potable) has carbon content more than 24.
3.Carbon content of 0 and ph of 5 is potable.
sns.scatterplot(x=df.ph,y=df.Hardness,hue=df.Potability)
<AxesSubplot:xlabel='ph', ylabel='Hardness'>
df.hist(figsize=(20,20))
plt.show()
#our every column is normally distributed except the hardness of water.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3276 entries, 0 to 3275 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ph 3276 non-null float64 1 Hardness 3276 non-null float64 2 Solids 3276 non-null float64 3 Chloramines 3276 non-null float64 4 Sulfate 3276 non-null float64 5 Conductivity 3276 non-null float64 6 Organic_carbon 3276 non-null float64 7 Trihalomethanes 3276 non-null float64 8 Turbidity 3276 non-null float64 9 Potability 3276 non-null int64 dtypes: float64(9), int64(1) memory usage: 256.1 KB
sns.distplot(df['ph'])
#it clearly shows that our ph is normally distributed
#and it also refelect that most of the water quality has a ph of 7
C:\Users\asus\anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='ph', ylabel='Density'>
1.it clearly shows that our ph is normally distributed.
2.it also refelect that most of the water quality has a ph of 7
sns.pairplot(df,hue='Potability')
<seaborn.axisgrid.PairGrid at 0x227317a0880>
df.isnull().sum()
ph 0 Hardness 0 Solids 0 Chloramines 0 Sulfate 0 Conductivity 0 Organic_carbon 0 Trihalomethanes 0 Turbidity 0 Potability 0 dtype: int64
#missing values ar e imputed by using mean of their respective means
df['ph'] = df['ph'].fillna(df['ph'].mean())
df['Sulfate'] = df['Sulfate'].fillna(df['Sulfate'].mean())
df['Trihalomethanes'] = df['Trihalomethanes'].fillna(df['Trihalomethanes'].mean())
df.isnull().sum()
ph 0 Hardness 0 Solids 0 Chloramines 0 Sulfate 0 Conductivity 0 Organic_carbon 0 Trihalomethanes 0 Turbidity 0 Potability 0 dtype: int64
#create pie chart
plt.pie(df['Potability'].value_counts(),labels=['0-Not Potable to drink','1- Potable to drink'] ,autopct='%.0f%%')
plt.show()
X=df.drop('Potability',axis=1) #independent variables
y=df['Potability'] #target variable
#train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42,stratify=df.Potability)
X_train= (X_train-np.mean(X_train))/(np.std(X_train))
X_train
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | |
|---|---|---|---|---|---|---|---|---|---|
| 2018 | -0.007256 | 1.237015 | 1.550126 | 0.324340 | 1.300337 | -0.776008 | -0.247820 | -2.070449 | -0.185324 |
| 2740 | -0.205376 | 0.017734 | 0.081184 | -1.202600 | 1.812376 | -0.067932 | 0.505598 | -1.151825 | 0.692067 |
| 2746 | -0.990369 | -1.151796 | 0.622491 | -0.111131 | -0.139588 | -1.357775 | -0.199626 | -1.886960 | -1.232364 |
| 1468 | -0.552614 | -0.638412 | -1.164521 | -0.023118 | -0.005882 | 0.028698 | 0.431619 | 1.308690 | -0.885252 |
| 1417 | -0.264160 | -0.447205 | 0.284818 | -0.054383 | 0.201693 | 0.635287 | 1.781942 | -0.772217 | 0.323551 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 101 | 0.175256 | 0.813496 | -1.171744 | 1.116818 | -0.005882 | 0.815493 | 1.187441 | 0.481990 | 1.163680 |
| 1734 | -2.153575 | -0.154935 | -0.510590 | 0.370926 | -0.755536 | -0.270021 | -1.433374 | -1.307368 | 1.393805 |
| 394 | 1.120006 | -0.062634 | 0.281038 | 1.095626 | -0.136927 | -1.163126 | 0.689041 | 0.072272 | -0.215785 |
| 2242 | -0.303535 | -0.810680 | 0.184120 | -0.019019 | -0.005882 | 1.052216 | 1.835796 | 0.810608 | 0.255488 |
| 2846 | 0.089569 | -0.525946 | -0.056850 | -0.767844 | -0.005882 | 0.093427 | -0.729811 | 0.841859 | -0.005888 |
2293 rows × 9 columns
X_train= (X_train-np.mean(X_train))/(np.std(X_train))
X_train
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | |
|---|---|---|---|---|---|---|---|---|---|
| 2018 | -0.007256 | 1.237015 | 1.550126 | 0.324340 | 1.300337 | -0.776008 | -0.247820 | -2.070449 | -0.185324 |
| 2740 | -0.205376 | 0.017734 | 0.081184 | -1.202600 | 1.812376 | -0.067932 | 0.505598 | -1.151825 | 0.692067 |
| 2746 | -0.990369 | -1.151796 | 0.622491 | -0.111131 | -0.139588 | -1.357775 | -0.199626 | -1.886960 | -1.232364 |
| 1468 | -0.552614 | -0.638412 | -1.164521 | -0.023118 | -0.005882 | 0.028698 | 0.431619 | 1.308690 | -0.885252 |
| 1417 | -0.264160 | -0.447205 | 0.284818 | -0.054383 | 0.201693 | 0.635287 | 1.781942 | -0.772217 | 0.323551 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 101 | 0.175256 | 0.813496 | -1.171744 | 1.116818 | -0.005882 | 0.815493 | 1.187441 | 0.481990 | 1.163680 |
| 1734 | -2.153575 | -0.154935 | -0.510590 | 0.370926 | -0.755536 | -0.270021 | -1.433374 | -1.307368 | 1.393805 |
| 394 | 1.120006 | -0.062634 | 0.281038 | 1.095626 | -0.136927 | -1.163126 | 0.689041 | 0.072272 | -0.215785 |
| 2242 | -0.303535 | -0.810680 | 0.184120 | -0.019019 | -0.005882 | 1.052216 | 1.835796 | 0.810608 | 0.255488 |
| 2846 | 0.089569 | -0.525946 | -0.056850 | -0.767844 | -0.005882 | 0.093427 | -0.729811 | 0.841859 | -0.005888 |
2293 rows × 9 columns
X_test= (X_test-np.mean(X_test))/(np.std(X_test))
X_test
| ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | |
|---|---|---|---|---|---|---|---|---|---|
| 2137 | 0.488560 | -0.011067 | -0.017259 | -0.703960 | 2.358109 | -0.631020 | -0.992246 | -0.042473 | 0.956347 |
| 775 | 0.016888 | 0.285405 | 0.526661 | 0.142211 | -0.798964 | -0.156840 | 1.233073 | -0.354832 | 0.827611 |
| 1820 | -0.280528 | 0.289723 | -0.226153 | -0.832869 | -0.561461 | 0.421291 | 0.496998 | -0.948525 | -0.318636 |
| 26 | -2.453204 | 0.370292 | 1.257337 | 1.047609 | 1.429718 | 0.247227 | -0.155509 | -2.339706 | 0.292649 |
| 1085 | 0.016888 | -0.159439 | 0.070639 | -0.457384 | -0.984132 | -0.089413 | -0.256831 | -1.028567 | -1.263046 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 497 | 0.373695 | -0.698562 | -1.326854 | 0.818926 | 0.014080 | -0.266926 | -0.522231 | -0.980949 | -0.920554 |
| 1256 | 0.290002 | -0.282866 | -0.328402 | 0.648303 | 1.012164 | 0.665581 | -0.649783 | 2.423836 | 0.177132 |
| 787 | -0.073389 | 0.719360 | -1.394922 | 2.115840 | 0.014080 | 0.389484 | -0.035405 | -0.376195 | -0.350732 |
| 829 | -1.038319 | 0.687186 | -0.044153 | -0.487284 | 0.014080 | -0.114714 | -0.498766 | -1.446299 | 0.816742 |
| 2356 | -1.058102 | -0.440475 | -0.107659 | -0.310095 | 0.534748 | -0.031541 | -0.338794 | 0.117078 | -1.137070 |
983 rows × 9 columns
Which model do you think would be most appropriate and why
1-Decision tree
2-Decision tree with parameter tuning
3-Random forest
4-random forest with parameter tuning
5-Logistic regression (we know it will get affected by outlier but still we tried)
6-adaboost
7-adaboost for desicion tree with hyper parameter tuning
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report,f1_score
model1 DT DECISION TREE
model1=DecisionTreeClassifier()
model1=model1.fit(X_train,y_train)
y_pred1=model1.predict(X_test)
confusion_matrix(y_test, y_pred1)
array([[408, 192],
[223, 160]], dtype=int64)
print(confusion_matrix(y_test, y_pred1))
print(classification_report(y_test, y_pred1))
print("f1 score is :",(f1_score(y_test, y_pred1)))
print("The accuracy of model is ",(accuracy_score(y_test, y_pred1)*100))
[[408 192]
[223 160]]
The accuracy of model is 57.78229908443541
precision recall f1-score support
0 0.65 0.68 0.66 600
1 0.45 0.42 0.44 383
accuracy 0.58 983
macro avg 0.55 0.55 0.55 983
weighted avg 0.57 0.58 0.57 983
f1 score is : 0.43537414965986393
#for better performance we need to do hyper parameter tuning
# Applying hyper parameter using grid search
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
#parameters
model1_a=DecisionTreeClassifier()
criterion=['gini','entropy']
splitter=['best','random']
min_samples_split=[2,4,6,8,10,12,14,16,18,20]
#GridSEARCH
grid=dict(splitter=splitter,criterion=criterion,min_samples_split=min_samples_split)
cv=RepeatedStratifiedKFold(n_splits=10,n_repeats=3,random_state=1)
grid_search_dt= GridSearchCV(estimator=model1_a,param_grid=grid,n_jobs=-1,cv=cv,scoring='accuracy',error_score=0)
grid_search_dt.fit(X_train,y_train)
GridSearchCV(cv=RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1),
error_score=0, estimator=DecisionTreeClassifier(), n_jobs=-1,
param_grid={'criterion': ['gini', 'entropy'],
'min_samples_split': [2, 4, 6, 8, 10, 12, 14, 16, 18,
20],
'splitter': ['best', 'random']},
scoring='accuracy')
print(f"Best: {grid_search_dt.best_score_:.3f} using {grid_search_dt.best_params_}")
means = grid_search_dt.cv_results_['mean_test_score']
stds = grid_search_dt.cv_results_['std_test_score']
params = grid_search_dt.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print(f"{mean:.3f} ({stdev:.3f}) with: {param}")
Best: 0.609 using {'criterion': 'entropy', 'min_samples_split': 20, 'splitter': 'random'}
0.595 (0.028) with: {'criterion': 'gini', 'min_samples_split': 2, 'splitter': 'best'}
0.575 (0.036) with: {'criterion': 'gini', 'min_samples_split': 2, 'splitter': 'random'}
0.595 (0.029) with: {'criterion': 'gini', 'min_samples_split': 4, 'splitter': 'best'}
0.570 (0.028) with: {'criterion': 'gini', 'min_samples_split': 4, 'splitter': 'random'}
0.592 (0.027) with: {'criterion': 'gini', 'min_samples_split': 6, 'splitter': 'best'}
0.592 (0.031) with: {'criterion': 'gini', 'min_samples_split': 6, 'splitter': 'random'}
0.597 (0.025) with: {'criterion': 'gini', 'min_samples_split': 8, 'splitter': 'best'}
0.588 (0.030) with: {'criterion': 'gini', 'min_samples_split': 8, 'splitter': 'random'}
0.600 (0.028) with: {'criterion': 'gini', 'min_samples_split': 10, 'splitter': 'best'}
0.591 (0.031) with: {'criterion': 'gini', 'min_samples_split': 10, 'splitter': 'random'}
0.606 (0.025) with: {'criterion': 'gini', 'min_samples_split': 12, 'splitter': 'best'}
0.600 (0.026) with: {'criterion': 'gini', 'min_samples_split': 12, 'splitter': 'random'}
0.603 (0.029) with: {'criterion': 'gini', 'min_samples_split': 14, 'splitter': 'best'}
0.596 (0.029) with: {'criterion': 'gini', 'min_samples_split': 14, 'splitter': 'random'}
0.605 (0.026) with: {'criterion': 'gini', 'min_samples_split': 16, 'splitter': 'best'}
0.590 (0.029) with: {'criterion': 'gini', 'min_samples_split': 16, 'splitter': 'random'}
0.607 (0.032) with: {'criterion': 'gini', 'min_samples_split': 18, 'splitter': 'best'}
0.606 (0.033) with: {'criterion': 'gini', 'min_samples_split': 18, 'splitter': 'random'}
0.606 (0.031) with: {'criterion': 'gini', 'min_samples_split': 20, 'splitter': 'best'}
0.600 (0.034) with: {'criterion': 'gini', 'min_samples_split': 20, 'splitter': 'random'}
0.600 (0.030) with: {'criterion': 'entropy', 'min_samples_split': 2, 'splitter': 'best'}
0.577 (0.032) with: {'criterion': 'entropy', 'min_samples_split': 2, 'splitter': 'random'}
0.594 (0.028) with: {'criterion': 'entropy', 'min_samples_split': 4, 'splitter': 'best'}
0.577 (0.030) with: {'criterion': 'entropy', 'min_samples_split': 4, 'splitter': 'random'}
0.601 (0.024) with: {'criterion': 'entropy', 'min_samples_split': 6, 'splitter': 'best'}
0.591 (0.032) with: {'criterion': 'entropy', 'min_samples_split': 6, 'splitter': 'random'}
0.594 (0.030) with: {'criterion': 'entropy', 'min_samples_split': 8, 'splitter': 'best'}
0.588 (0.032) with: {'criterion': 'entropy', 'min_samples_split': 8, 'splitter': 'random'}
0.602 (0.030) with: {'criterion': 'entropy', 'min_samples_split': 10, 'splitter': 'best'}
0.592 (0.031) with: {'criterion': 'entropy', 'min_samples_split': 10, 'splitter': 'random'}
0.604 (0.031) with: {'criterion': 'entropy', 'min_samples_split': 12, 'splitter': 'best'}
0.592 (0.028) with: {'criterion': 'entropy', 'min_samples_split': 12, 'splitter': 'random'}
0.599 (0.030) with: {'criterion': 'entropy', 'min_samples_split': 14, 'splitter': 'best'}
0.599 (0.028) with: {'criterion': 'entropy', 'min_samples_split': 14, 'splitter': 'random'}
0.602 (0.029) with: {'criterion': 'entropy', 'min_samples_split': 16, 'splitter': 'best'}
0.599 (0.026) with: {'criterion': 'entropy', 'min_samples_split': 16, 'splitter': 'random'}
0.605 (0.026) with: {'criterion': 'entropy', 'min_samples_split': 18, 'splitter': 'best'}
0.596 (0.033) with: {'criterion': 'entropy', 'min_samples_split': 18, 'splitter': 'random'}
0.607 (0.027) with: {'criterion': 'entropy', 'min_samples_split': 20, 'splitter': 'best'}
0.609 (0.028) with: {'criterion': 'entropy', 'min_samples_split': 20, 'splitter': 'random'}
model2 DECISION TREE WITH HYPER PARAMETER TUNING (WITH ENTROPY)
#we got the best parameters {'criterion': 'entropy', 'min_samples_split': 18, 'splitter': 'random'}
model1_dt=DecisionTreeClassifier(criterion ='entropy', min_samples_split=18, splitter='random')
model1_dt=model1_dt.fit(X_train,y_train)
y_pred1_dt=model1_dt.predict(X_test)
confusion_matrix(y_test, y_pred1_dt)
array([[396, 204],
[202, 181]], dtype=int64)
print(confusion_matrix(y_test, y_pred1_dt))
print(classification_report(y_test, y_pred1_dt))
print("f1 score is :",(f1_score(y_test, y_pred1_dt)))
print("The accuracy of model is ",(accuracy_score(y_test, y_pred1_dt)*100))
[[396 204]
[202 181]]
precision recall f1-score support
0 0.66 0.66 0.66 600
1 0.47 0.47 0.47 383
accuracy 0.59 983
macro avg 0.57 0.57 0.57 983
weighted avg 0.59 0.59 0.59 983
f1 score is : 0.4713541666666667
The accuracy of model is 58.697863682604265
model3 DT (DESICION TREE WITH HYPER PARAMETER TUNING WITH GINI)
model1_dt2=DecisionTreeClassifier(criterion ='gini', min_samples_split=18, splitter='random')
model1_dt2=model1_dt2.fit(X_train,y_train)
y_pred1_dt2=model1_dt2.predict(X_test)
confusion_matrix(y_test, y_pred1_dt2)
array([[429, 171],
[236, 147]], dtype=int64)
print(confusion_matrix(y_test, y_pred1_dt2))
print(classification_report(y_test, y_pred1_dt2))
print("f1 score is :",(f1_score(y_test, y_pred1_dt2)))
print("The accuracy of model is ",(accuracy_score(y_test, y_pred1_dt2)*100))
[[429 171]
[236 147]]
precision recall f1-score support
0 0.65 0.71 0.68 600
1 0.46 0.38 0.42 383
accuracy 0.59 983
macro avg 0.55 0.55 0.55 983
weighted avg 0.57 0.59 0.58 983
f1 score is : 0.4194008559201141
The accuracy of model is 58.59613428280773
TRYING WITH LOGISTICS REGRESSION
from sklearn.linear_model import LogisticRegression
model2=LogisticRegression(class_weight='balance',max_iter=200)
model2=model1.fit(X_train,y_train)
from sklearn.metrics import confusion_matrix
#to check the working performance ogfthe logistic regression model
y_pred2=model2.predict(X_test)
confusion_matrix(y_test, y_pred2)
array([[405, 195],
[220, 163]], dtype=int64)
print(classification_report(y_test, y_pred2))
print("f1 score is :",(f1_score(y_test, y_pred2)))
print("The accuracy of model is ",(accuracy_score(y_test, y_pred1_dt2)*100))
precision recall f1-score support
0 0.65 0.68 0.66 600
1 0.46 0.43 0.44 383
accuracy 0.58 983
macro avg 0.55 0.55 0.55 983
weighted avg 0.57 0.58 0.58 983
f1 score is : 0.43994601889338725
The accuracy of model is 58.59613428280773
USING ENSEMBLE MODEL RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
model3=RandomForestClassifier()
model3
RandomForestClassifier()
model3=model3.fit(X_train,y_train)
from sklearn.metrics import confusion_matrix
#to check the working performance of the model
y_pred3=model3.predict(X_test)
confusion_matrix(y_test, y_pred3)
array([[529, 71],
[262, 121]], dtype=int64)
print(confusion_matrix(y_test, y_pred3))
print(classification_report(y_test, y_pred3))
print("f1 score is :",(f1_score(y_test, y_pred3)))
print("The accuracy of model is ",(accuracy_score(y_test, y_pred3)*100))
[[529 71]
[262 121]]
The accuracy of model is 66.12410986775178
precision recall f1-score support
0 0.67 0.88 0.76 600
1 0.63 0.32 0.42 383
accuracy 0.66 983
macro avg 0.65 0.60 0.59 983
weighted avg 0.65 0.66 0.63 983
f1 score is : 0.4208695652173914
#hyper paramter tuning
from sklearn.model_selection import GridSearchCV
param_grid = { 'bootstrap': [True], 'max_depth': [5, 10,15, None], 'max_features': ['auto', 'log2'],
'n_estimators': [5, 6, 7, 8, 9, 10, 11, 12, 13, 15,20,25,30,35]}
rfc = RandomForestClassifier(random_state = 1)
g_search = GridSearchCV(estimator = rfc, param_grid = param_grid,
cv = 3, n_jobs = 1, verbose = 0, return_train_score=True)
g_search.fit(X_train, y_train)
print(g_search.best_params_)
{'bootstrap': True, 'max_depth': 15, 'max_features': 'auto', 'n_estimators': 35}
RANDOM FOREST WITH HYPER PARAMETER TUNING
#fitting model with parameter obtains
model3_a=RandomForestClassifier(bootstrap= True, max_depth= 15, max_features= 'auto', n_estimators= 35)
model3_a
RandomForestClassifier(max_depth=15, n_estimators=35)
model3_a=model3_a.fit(X_train,y_train)
from sklearn.metrics import confusion_matrix
#to check the working performance of the model
y_pred3_a=model3_a.predict(X_test)
confusion_matrix(y_test, y_pred3_a)
array([[537, 63],
[266, 117]], dtype=int64)
print(confusion_matrix(y_test, y_pred3_a))
print(classification_report(y_test, y_pred3_a))
print("f1 score is :",(f1_score(y_test, y_pred3_a)))
print("The accuracy of model is ",(accuracy_score(y_test, y_pred3_a)*100))
[[537 63]
[266 117]]
precision recall f1-score support
0 0.67 0.90 0.77 600
1 0.65 0.31 0.42 383
accuracy 0.67 983
macro avg 0.66 0.60 0.59 983
weighted avg 0.66 0.67 0.63 983
f1 score is : 0.4156305506216697
The accuracy of model is 66.53102746693794
# adabost classifier
from sklearn.ensemble import AdaBoostClassifier
model6=AdaBoostClassifier()
model6=model6.fit(X_train,y_train)
y_pred6=model6.predict(X_test)
confusion_matrix(y_test, y_pred6)
array([[519, 81],
[294, 89]], dtype=int64)
print("f1 score is :",(f1_score(y_test, y_pred6)))
print("The accuracy of model is ",(accuracy_score(y_test, y_pred6)*100))
f1 score is : 0.321880650994575 The accuracy of model is 61.85147507629705
#adabost with parameter tuninig
model7=AdaBoostClassifier(base_estimator=model1_dt)
model7=model7.fit(X_train,y_train)
y_pred7=model7.predict(X_test)
confusion_matrix(y_test, y_pred7)
array([[428, 172],
[211, 172]], dtype=int64)
print("f1 score is :",(f1_score(y_test, y_pred7)))
print("The accuracy of model is ",(accuracy_score(y_test, y_pred7)*100))
f1 score is : 0.4731774415405777 The accuracy of model is 61.03763987792472
dic={'Model':['Decision tree','Decision tree-HPT1','Decision tree-HPT2','Logistic regression','Random Forest','Random Forest-HPT','Adaboost','Adaboost_HPT'],'Accuracy':[57.78,58.7,58.6,58.5,66.12,66.53,61.85,61.0],'FPR in %':[30,34,25,31.2,9,7.8,10,26]}
results=pd.DataFrame(dic)
results
| Model | Accuracy | FPR in % | |
|---|---|---|---|
| 0 | Decision tree | 57.78 | 30.0 |
| 1 | Decision tree-HPT1 | 58.70 | 34.0 |
| 2 | Decision tree-HPT2 | 58.60 | 25.0 |
| 3 | Logistic regression | 58.50 | 31.2 |
| 4 | Random Forest | 66.12 | 9.0 |
| 5 | Random Forest-HPT | 66.53 | 7.8 |
| 6 | Adaboost | 61.85 | 10.0 |
| 7 | Adaboost_HPT | 61.00 | 26.0 |